Build a Large-Scale Syntactically Annotated Chinese Corpus

نویسنده

  • Qiang Zhou
چکیده

This paper reports on our research to build a large-scale Tsinghua Chinese Treebank (TCT). We propose a two-stage approach to reduce manual proofreading labors as much as possible. The insertion of an intermediate functional chunk level creates a good information bridge to link simple chunk annotation with detailed syntactic tree annotation. We describe our chunk and tree annotation schemes, focus on two grammatical relation tag sets designed to give more detailed description for most of the special language phenomena in the Chinese language. We also briefly introduce our current progress in building a Chinese chunk bank with 2,000,000 Chinese characters, developing an efficient Chinese chunk-based parser and building a 1,000,000 words Chinese treebank. All this work lays good foundations for further research project to build a good Chinese parser.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Large-Scale Japanese CFG Derived from a Syntactically Annotated Corpus and Its Evaluation

Although large-scale grammars are prerequisite for parsing a great variety of sentences, it is difficult to build such grammars by hand. Yet, it is possible to build a context-free grammar (CFG) by deriving it from a syntactically annotated corpus. Many such corpora have been built recently to obtain statistical information concerning corpus-based NLP technologies. For English, it is well known...

متن کامل

Evaluation of a Japanese CFG Derived from a Syntactically Annotated Corpus with Respect to Dependency Measures

Parsing is one of the important processes for natural language processing and, in general, a large-scale CFG is used to parse a wide variety of sentences. For many languages, a CFG is derived from a large-scale syntactically annotated corpus, and many parsing algorithms using CFGs have been proposed. However, we could not apply them to Japanese since a Japanese syntactically annotated corpus ha...

متن کامل

Huge Parsed Corpora in LASSY

One of the goals of the LASSY STEVIN project (Large Scale Syntactic Annotation of written Dutch) is a syntactically annotated (manually verified) corpus of 1 million words. In addition, the full STEVIN reference corpus of 500 million words will be syntactically annotated automatically. In this paper, the potential of such huge treebanks for applications in corpus linguistics, natural language p...

متن کامل

CASIA-CASSIL: a Chinese Telephone Conversation Corpus in Real Scenarios with Multi-leveled Annotation

CASIA-CASSIL is a large-scale corpus base of Chinese human-human naturally-occurring telephone conversations in restricted domains. The first edition consists of 792 90-second conversations belonging to tourism domain, which are selected from 7,639 spontaneous telephone recordings in real scenarios. The corpus is now being annotated with wide range of linguistic and paralinguistic information i...

متن کامل

Building a Large-Scale Japanese CFG for Syntactic Parsing

Large-scale grammars are a prerequisite for parsing a great variety of sentences, but it is difficult to build such grammars by hand. Yet, it is possible to derive a context-free grammar(CFG) automatically from an existing large-scale, syntactically annotated corpus. While being seemingly a simple task at first sight, CFGs derived in such a fashion have hardly ever been applied to an existing s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003